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Abstract We introduce a novel matching algorithm, 
called DeepMatching, to compute dense correspondences 
between images. DeepMatching relies on a hierarchi¬ 
cal, multi-layer, correlational architecture designed for 
matching images and was inspired by deep convolu¬ 
tional approaches. The proposed matching algorithm 
can handle non-rigid deformations and repetitive tex¬ 
tures and efficiently determines dense correspondences 
in the presence of significant changes between images. 

We evaluate the performance of DeepMatching, in 
comparison with state-of-the-art matching algorithms, 
on the Mikolajczyk (Mikolajczyk et al 2005), the MPI- 
Sintel (Butler et al 2012) and the Kitti (Geiger et al 
2013) datasets. DeepMatching outperforms the state- 
of-the-art algorithms and shows excellent results in par¬ 
ticular for repetitive textures. 

We also propose a method for estimating optical 
flow, called DeepFlow, by integrating DeepMatching in 
the large displacement optical flow (LDOF) approach 
of Brox and Malik (2011). Compared to existing match¬ 
ing algorithms, additional robustness to large displace¬ 
ments and complex motion is obtained thanks to our 
matching approach. DeepFlow obtains competitive per¬ 
formance on public benchmarks for optical flow estima¬ 
tion. 

Keywords Non-rigid dense matching, optical flow. 

1 Introduction 

Computing correspondences between related images is 
a central issue in many computer vision problems, rang¬ 
ing from scene recognition to optical flow estimation 
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(Forsyth and Ponce 2011; Szeliski 2010). The goal of a 
matching algorithm is to discover shared visual con¬ 
tent between two images, and to establish as many 
as possible precise point-wise correspondences, called 
matches. An essential aspect of matching approaches is 
the amount of rigidity they assume when computing the 
correspondences. In fact, matching approaches range 
between two extreme cases: stereo matching, where match¬ 
ing hinges upon strong geometric constraints, and match¬ 
ing “in the wild”, where the set of possible transforma¬ 
tions from the source image to the target one is large 
and the problem is basically almost unconstrained. Ef¬ 
fective approaches have been designed for matching rigid 
objects across images in the presence of large view¬ 
point changes (Lowe 2004; Barnes et al 2010; HaCo- 
hen et al 2011). However, the performance of current 
state-of-the-art matching algorithms for images “in the 
wild”, such as consecutive images in real-world videos 
featuring fast non-rigid motion, still calls for improve¬ 
ment (Xu et al 2012; Chen et al 2013). In this paper, 
we aim at tackling matching in such a general setting. 

Matching algorithms for images “in the wild” need 
to accommodate several requirements, that turn out to 
be often in contradiction. On one hand, matching ob¬ 
jects necessarily requires rigidity assumptions to some 
extent. It is also mandatory that these objects have 
sufficiently discriminative textures to make the prob¬ 
lem well-defined. On the other hand, many objects or 
regions are not rigid objects, like humans or animals. 
Furthermore, large portions of an image are usually 
occupied by weakly-to-no textured regions, often with 
repetitive textures, like sky or bucolic background. 

Descriptor matching approaches, such as SIFT (Lowe 
2004) or HOG (Dalai and Triggs 2005; Brox and Ma¬ 
lik 2011) matching, compute discriminative feature rep¬ 
resentations from rectangular patches. However, while 
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these approaches succeed in case of rigid motion, they 
fail to match regions with weak or repetitive textures, 
as local patches are poorly discriminative. Furthermore, 
matches are usually poor and imprecise in case of non- 
rigid deformations, as these approaches rely on rigid 
patches. Discriminative power can be traded against 
increased robustness to non-rigid deformations. Indeed, 
propagation-based approaches, such as Generalized Patch- 
Match (Barnes et al 2010) or Non-rigid Dense Corre¬ 
spondences (HaCohen et al 2011), compute simple fea¬ 
ture representations from small patches and propagate 
matches to neighboring patches. They yield good per¬ 
formance in case of non-rigid deformations. However, 
matching repetitive textures remains beyond the reach 
of these approaches. 

In this paper we propose a novel approach, called 
DeepMatching, that gracefully combines the strengths 
of these two families of approaches. DeepMatching is 
computed using a multi-layer architecture, which breaks 
down patches into a hierarchy of sub-patches. This ar¬ 
chitecture allows to work at several scales and handle 
repetitive textures. Furthermore, within each layer, lo¬ 
cal matches are computed assuming a restricted set 
of feasible rigid deformations. Local matches are then 
propagated up the hierarchy, which progressively dis¬ 
card spurious incorrect matches. We called our approach 
DeepMatching, as it is inspired by deep convolutional 
approaches. 

In summary, we make three contributions: 

• Dense matching: we propose a matching algo¬ 
rithm, DeepMatching, that allows to robustly deter¬ 
mine dense correspondences between two images. It ex¬ 
plicitly handles non-rigid deformations, with bounds on 
the deformation tolerance, and incorporates a multi¬ 
scale scoring of the matches, making it robust to repet¬ 
itive or weak textures. Furthermore, our approach is 
based on gradient histograms, and thus robust to ap¬ 
pearance changes caused by illumination and color vari¬ 
ations. 

• Fast, scale/rotation-invariant matching: we 

propose a computationally efficient version of Deep- 
Matching, which performs almost as well as exact Deep- 
Matching, but at a much lower memory cost. Further¬ 
more, this fast version of DeepMatching can be ex¬ 
tended to a scale and rotation-invariant version, making 
it an excellent competitor to state-of-the-art descriptor 
matching approaches. 

• Large-displacement optical flow: we propose 
an optical flow approach which uses DeepMatching in 
the matching term of the large displacement variational 
energy minimization of Brox and Malik (2011). We show 
that DeepMatching is a better choice compared to the 
HOG descriptor used by Brox and Malik (2011) and 


other state-of-the-art matching algorithms. The approach, 
named DeepFlow, obtains competitive results on public 
optical flow benchmarks. 

This paper is organized as follows. After a review of 
previous works (Section 2), we start by presenting the 
proposed matching algorithm, DeepMatching, in Sec¬ 
tion 3. Then, Section 4 describes several extensions of 
DeepMatching. In particular, we propose an optical flow 
approach, DeepFlow, in Section 4.3. Finally, we present 
experimental results in Section 5. 

A preliminary version of this article has appeared 
in Weinzaepfel et al (2013). This version adds (1) an in- 
depth presentation of DeepMatching; (2) an enhanced 
version of DeepMatching, which improves the match 
scoring and the selection of entry points for backtrack¬ 
ing; (3) proofs on time and memory complexity of Deep- 
Matching as well as its deformation tolerance; (4) a 
discussion on the connection between Deep Convolu¬ 
tional Neural Networks and DeepMatching; (5) a fast 
approximate version of DeepMatching; (6) a scale and 
rotation invariant version of DeepMatching; and (7) an 
extensive experimental evaluation of DeepMatching on 
several state-of-the-art benchmarks. The code for Deep- 
Matching as well as DeepFlow are available at http: // 
lear.inrialpes.fr/src/deepmatching/ and http:// 
lear. inrialpes. f r/src/deepf low/. Note that we pro¬ 
vide a GPU implementation in addition to the CPU 
one. 

2 Related work 

In this section we review related work on “general” im¬ 
age matching, that is matching without prior knowl¬ 
edge and constraints, and on matching in the context 
of optical flow estimation, that is matching consecutive 
images in videos. 

2.1 General image matching 

Image matching based on local features has been exten¬ 
sively studied in the past decade. It has been applied 
successfully to various domains, such as wide baseline 
stereo matching (Furukawa et al 2010) and image re¬ 
trieval (Philbin et al 2010). It consists of two steps, z.e., 
extracting local descriptors and matching them. Image 
descriptors are extracted in rigid (generally square) lo¬ 
cal frames at sparse invariant image locations (Mikola- 
jczyk et al 2005; Szeliski 2010). Matching then equals 
nearest neighbor search between descriptors, followed 
by an optional geometric verification. Note that a con¬ 
fidence value can be obtained by computing the unique¬ 
ness of a match, z.e., by looking at the distance of its 
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nearest neighbors (Lowe 2004; Brox and Malik 2011). 
While this class of techniques is well suited for well- 
textured rigid objects, it fails to match non-rigid ob¬ 
jects and weakly textured regions. 

In contrast, the proposed matching algorithm, called 
DeepMatching^ is inspired by non-rigid 2D warping and 
deep convolutional networks (LeCun et al 1998a; Uchida 
and Sakoe 1998; Keysers et al 2007). This family of ap¬ 
proaches explicitly models non-rigid deformations. We 
employ a novel family of feasible warpings that does 
not enforce monotonicity nor continuity constraints, in 
contrast to traditional 2D warping (Uchida and Sakoe 
1998; Keysers et al 2007). This makes the problem com¬ 
putationally much less expensive. 

It is also worthwhile to mention the similarity with 
non-rigid matching approaches developed for a broad 
range of applications. Ecker and Ullman (2009) pro¬ 
posed a similar pipeline to ours (albeit more complex) 
to measure the similarity of small images. However, 
their method lacks a way of merging correspondences 
belonging to objects with contradictory motions, e.^., 
on different focal planes. For the purpose of establish¬ 
ing dense correspondences between images. Wills et al 
(2006) estimated a non-rigid matching by robustly fit¬ 
ting smooth parametric models (homography and splines) 
to local descriptor matches. In contrast, our approach 
is non-parametric and model-free. 

Recently, fast algorithms for dense patch matching 
have taken advantage of the redundancy between over¬ 
lapping patches (Barnes et al 2010; Korman and Avi- 
dan 2011; Sun 2012; Yang et al 2014). The insight is 
to propagate good matches to their neighborhood in 
a loose fashion, yielding dense non-rigid matches. In 
practice, however, the lack of a smoothness constraint 
leads to highly discontinuous matches. Several works 
have proposed ways to fix this. HaCohen et al (2011) 
reinforce neighboring matches using an iterative multi¬ 
scale expansion and contraction strategy, performed in 
a coarse-to-fine manner. Yang et al (2014) include a 
guided filtering stage on top of PatchMatch, which ob¬ 
tains smooth correspondence fields by locally approx¬ 
imating a MRF. Finally, Kim et al (2013) propose a 
hierarchical matching to obtain dense correspondences, 
using a coarse-to-fine (top-down) strategy. Loopy belief 
propagation is used to perform inference. 

In contrast to these approaches, DeepMatching pro¬ 
ceeds bottom-up and, then, top-down. Due to its hierar¬ 
chical nature, DeepMatching is able to consider patches 
at several scales, thus overcoming the lack of distinc¬ 
tiveness that affects small patches. Yet, the multi-layer 
construction allows to efficiently perform matching al¬ 
lowing semi-rigid local deformations. In addition, Deep- 
Matching can be computed efficiently, and can be fur¬ 


ther accelerated to satisfy low-memory requirements 
with negligible loss in accuracy. 


2.2 Matching for flow estimation 

Variational energy minimization is currently the most 
popular framework for optical flow estimation. Since 
the pioneering work of Horn and Schunck (1981), re¬ 
search has focused on alleviating the drawbacks of this 
approach. A series of improvements were proposed over 
the years (Black and Anandan 1996; Werlberger et al 
2009; Bruhn et al 2005; Papenberg et al 2006; Baker 
et al 2011; Sun et al 2014b; Vogel et al 2013a). The vari¬ 
ational approach of Brox et al (2004) combines most of 
these improvements in a unified framework. The energy 
decomposes into several terms, resp. the data-fitting 
and the smoothness terms. Energy minimization is per¬ 
formed by solving the Euler-Lagrange equations, reduc¬ 
ing the problem to solving a sequence of large and struc¬ 
tured linear systems. 

More recently, the addition of a descriptor match¬ 
ing term in the energy to be minimized was proposed 
by Brox and Malik (2011). Following this idea, several 
papers (Tola et al 2008; Brox and Malik 2011; Liu et al 
2011; Hassner et al 2012) show that dense descriptor 
matching improves performance. Strategies such as re¬ 
ciprocal nearest-neighbor verification (Brox and Malik 
2011) allow to prune most of the false matches. How¬ 
ever, a variational energy minimization approach that 
includes such a descriptor matching term may fail at 
locations where matches are missing or wrong. 

Related approaches tackle the problem of dense scene 
correspondence. SIFT-flow (Lin et al 2011), one of the 
most famous method in this context, also formulates the 
matching problem in a variational framework. Hassner 
et al (2012) improve over SIFT-flow by using multi¬ 
scale patches. However, this decreases performance in 
cases where scale invariance is not required. Xu et al 
(2012) integrate matching of SIFT (Lowe 2004) and 
PatchMatch (Barnes et al 2010) to refine the flow ini¬ 
tialization at each level. Excellent results are obtained 
for optical flow estimation, yet at the cost of expen¬ 
sive fusion steps. Leordeanu et al (2013) extends sparse 
matches with locally affine constraints to dense matches 
and, then, uses a total variation algorithm to refine the 
flow estimation. We present here a computationally ef¬ 
ficient and competitive approach for large displacement 
optical flow by integrating the proposed DeepMatching 
algorithm into the approach of Brox and Malik (2011). 
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Fig. 1 Illustration of moving quadrant similarity: a quadrant 
is a quarter of a SIFT patch, i.e. a group of 2 x 2 cells. Le/t; 
SIFT descriptor in the first image. Middle: second image with 
optimal standard SIFT matching (rigid). Right: second image 
with optimal moving quadrant SIFT matching. In this example, 
the patch covers various objects moving in different directions: 
for instance the car moves to the right while the cloud to the 
left. Rigid matching fails to capture this, whereas the moving 
quadrant approach is able to follow each object. 


3 DeepMatching 

This section introduces our matching algorithm Deep- 
Matching. DeepMatching is a matching algorithm based 
on correlations at the patch-level, that proceeds in a 
multi-layer fashion. The multi-layer architecture relies 
on a quadtree-like patch subdivision scheme, with an 
extra degree of freedom to locally re-optimize the po¬ 
sitions of each quadrant. In order to enhance the con¬ 
trast of the spatial correlation maps output by the local 
correlations, a nonlinear transformation is applied after 
each layer. 

We first give an overview of DeepMatching in Sec¬ 
tion 3.1 and show that it can be decomposed in a bottom- 
up pass followed by a top-down pass. We, then, present 
the bottom-up pass in Section 3.2 and the top-down 
one in Section 3.3. Finally, we analyze DeepMatching 
in Section 3.4. 


3.1 Overview of the approach 

A state-of-the-art approach for matching regions be¬ 
tween two images is based on the SIFT descriptor (Lowe 
2004) . SIFT is a histogram of gradients with 4x4 spa¬ 
tial and 8 orientation bins, yielding a robust descriptor 
R G that effectively encodes a square image re¬ 

gion. Note that its 4 x 4 cell grid can also be viewed as 4 
so-called “quadrants” of 2 x 2 cells, see Figure 1. We can, 
then, rewrite R = [Rq, Ri, R 2 , R 3 ] with R^ G 

Let R and R' be the SIFT descriptors of the cor¬ 
responding regions in the source and target image. In 
order to remove the effect of non-rigid motion, we pro¬ 
pose to optimize the positions pi G of the 4 quad¬ 
rants of the target descriptor R' (rather than keeping 
them fixed), in order to maximize 

1 ^ 

sim(R, R') = max - W sim (Rj, R'(pi)) > (1) 

{Pi} 4 



Fig. 2 Left: Quadtree-like patch hierarchy in the first image. 
Right: one possible displacement of corresponding patches in the 
second image. 


where R-(pi) is the descriptor of a single quadrant ex¬ 
tracted at position pi and sim() a similarity function. 
Now, sim(R, R') is able to handle situations such as 
the one presented in Figure 1, where a region contains 
multiple objects moving in different directions. Further¬ 
more, if the four quadrants can move independently (of 
course, within some extent), it can be calculated more 
efficiently as: 

1 ^ 

sim(R, R') = - max sim (Ri,R'(ft)), (2) 

4 Pi 

1=0 

When applied recursively to each quadrant by subdi¬ 
vided it into 4 sub-quadrants until a minimum patch 
size is reached (atomic patches), this strategy allows 
for accurate non-rigid matching. Such a recursive de¬ 
composition can be represented as a quad-tree, see Fig¬ 
ure 2. Given an initial pair of two matching regions, re¬ 
trieving atomic patch correspondences is then done in 
a top-down fashion {i.e. by recursively applying Eq. (2) 
to the quadrant’s positions {pi}). 

Nevertheless, in order to first determine the set of 
matching regions between the two images, we need to 
compute beforehand the matching scores {i.e. similar¬ 
ity) of all large-enough patches in the two images (as in 
Figure 1), and keep the pairs with maximum similarity. 
As indicated by Eq. (2), the score is formed by averag¬ 
ing the max-pooled scores of the quadrants. Hence, the 
process of computing the matching scores is bottom-up. 
In the following, we call eorrelation map the matching 
scores of a single patch from the first image at every po¬ 
sition in the second image. Selecting matching patches 
then corresponds to finding local maxima in the corre¬ 
lation maps. 

To sum-up, the algorithm can be decomposed in two 
steps: (i) first, correlation maps are computed using a 
bottom-up algorithm, as shown in Eigure 6. Correla¬ 
tion maps of small patches are first computed and then 
aggregated to form correlation maps of larger patches; 
(ii) next, a top-down method estimates the motion of 
atomic patches starting from matches of large patches. 
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In the remainder of this section, we detail the two 
steps described above (Section 3.2 and Section 3.3), be¬ 
fore analyzing the properties of DeepMatching in Sec¬ 
tion 3.4. 


3.2 Bottom-up correlation pyramid computation 

Let I and /' be two images of resolution W x H and 
W' X H'. 




Non-overlapping 
atomic patches 

{-^4,p}pe04 


Correlations 


Level 1 

correlation maps 



Fig. 3 Computing the bottom level correlation maps 
{C 4 ,p}pG^ 4 - Given two images I and the first one is 
split into non-overlapping atomic patches of size 4x4 pixels. 
For each patch, we compute the correlation at every location of 
I' to obtain the corresponding correlation map. 


Bottom level. We use patches of size 4x4 pixels as 
atomic patches. We split / into non-overlapping atomic 
patches, and compute the correlation map with image 
r for each of them, see Figure 3. The score between 
two atomic patches R and R' is defined as the average 
pixel-wise similarity: 

sim(R,RO = ^^^R,hRb, (3) 

i=0 jf=0 

where each pixel Hij is represented as a histogram of 
oriented gradients pooled over a local neighborhood. 
We detail below how the pixel descriptor is computed. 


Pixel descriptor Ri,j.* We rely on a robust pixel repre¬ 
sentation that is similar in spirit to SIFT and DAISY 
(Lowe 2004; Tola et al 2010). Given an input image 
/, we first apply a Gaussian smoothing of radius ui in 
order to denoise / from potential artifacts caused for 
example by JPEG compression. We then extract the 
gradient { 6 x, 6 y) at each pixel and compute its non¬ 
negative projection onto 8 orientations {(cos i|, sini|)} 
At this point, we obtain 8 oriented gradient maps. We 
smooth each map with a Gaussian filter of radius z/ 2 . 
Next we cap strong gradients using a sigmoid x 
2/(1 + exp(—^x)) — 1 , to help canceling out effects of 
varying illumination. We smooth gradients one more 
time for each orientation with a Gaussian filter of ra¬ 
dius z/ 3 . Finally, the descriptor for each pixel is obtained 
by the ^ 2 -normalized concatenation of 8 oriented gradi¬ 
ents and a ninth small constant value fi. Appending y 
amounts to adding a regularizer that will reduce the im¬ 
portance of small gradients {i.e. noise) and ensures that 
two pixels lying in areas without gradient information 
will still correlate positively. Pixel descriptors Hij are 
compared using dot-product and the similarity func¬ 
tion takes value in the interval [0,1]. In Section 5.2.1, 
we evaluate the impact of the parameters of this pixel 
descriptor. 


Bottom-level correlation map: We can express the cor¬ 
relation map computation obtained from Eq. (3) more 
conveniently in a convolutional framework. Let I^ p be 
a patch of size N x N from the first image centered at 
p (A^ > 4 is a power of 2 ). Let Q 4 = {2, 6 ,10,..., IF — 
2} X {2, 6 ,10,..., — 2} be a grid with step 4 pixels. Q 4 

is the set of the centers of the atomic patches. Eor each 
p G 04, we convolve the flipped patch if ^ over I' 

Ci,p = llp^I', (4) 

to get the correlation map C 4 ^p, where denotes an 
horizontal and vertical flip^. Eor any pixel p' of 
C 4 ,p(p') is a measure of similarity between l 4 ^^p and 
1'^ p ,. Examples of such correlation maps are shown in 
Eigure 3 and Eigure 4. Without surprise we can observe 
that atomic patches are not discriminative. Recursive 
aggregation of patches in subsequent stages will be the 
key to create discriminative responses. 


Iteration. We then compute the correlation maps of 
larger patches by aggregating those of smaller patches. 
As shown in Eigure 5, a A" x A" patch I^^p is the con- 
^^i-eMenation of 4 patches of size A/2 x A/2: 


with 


'00 = [-1,-1]^, 

' 02 = [+ 1 ,- 1 ]^, 
^03 = [+1,+1]^. 


(5) 


They correspond respectively to the bottom-left, top- 
left, bottom-right and top-right quadrants. The corre¬ 
lation map of In,p can thus be computed using its chil¬ 
dren’s correlation maps. Eor the sake of clarity, we de¬ 
fine the short-hand notation Siv ,2 = ^Oi describing the 
positional shift of a children patch i G [0,3] relatively 
to its parent patch (see Eigure 5). 

Using the above notations, we rewrite Eq. (2) by 

replacing sim(R, R') Cn,p{p') {i-e. assuming here 
that patch R = I^ p and that R' is centered at p' G /'). 


^ This amounts to the cross-correlation of the patch and I'. 
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Fig. 4 Correlation maps for patches of different size. Middle-left: 
correlation map of a 4x4 patch. Bottom-right: correlation map of 
a 16x16 patch obtained by aggregating correlation responses of 
children 8x8 patches (bottom-left), themselves obtained from 4x4 
patches. The map of the 16x16 patch is clearly more discrimina¬ 
tive than previous ones despite the change in appearance of the 
region. 



Fig. 5 A patch In,p from the first image (blue box) and one of 
its 4 quadrants /tv (red box). 

Similarly, we replace the similarity between children 
patches sim (R^, R-(p')) by Cv , (p-). For each child, 
we retain the maximum similarity over a small neigh¬ 
borhood 0i of width and height ^ centered at p' -\-SN,i- 
We then obtain: 

CnAp') = (6) 

i=0 * 


We now explain how we can break down Eq. (6) into 
a succession of simple operations. First, let us assume 
that N = 4 X 2^, where ^ > 1 is the current iteration. 
During iteration £, we want to compute the correlation 
maps Cn^p of every patch I^ p from the first image for 
which correlation maps of its children have been com¬ 
puted in the previous iteration. Formally, the position 
Gn of such patches is defined according to the position 


of children patches Qn according to Eq. (5): 

Qn = {p I P + ^N,i G [0, VF - 1] X [0, H -1] A 

P + ^N,i G Gn , i = 0, . . . , sj . (7) 

We observe that the larger a patch is (z.e. after several 
iterations), the smaller the spatial variation of its cor¬ 
relation map (see Figure 4). This is due to the statistics 
of natural images, in which low frequencies significantly 
dominate over high frequencies. As a consequence, we 
choose to subsample each map Cn,p by a factor 2. We 
express this with an operator S: 

S:C{p')^C{2p'). (8) 

The subsampling reduces by 4 the area of the correla¬ 
tion maps and, as a direct consequence, the computa¬ 
tional requirements. Instead of computing the subsam¬ 
pling on top of Eq. (6), it is actually more efficient to 
propagate it towards the children maps and perform 
it jointly with max-pooling. It also makes the max¬ 
pooling domain 0i become independent from N in the 
subsampled maps, as it exactly cancels out the effect of 
doubling A" = 4 x 2^ at each iteration. We call V the 
max-pooling operator with the iteration-independent 
domain 0 = { — 1, 0,1} x {—1,0,1}: 

V : C(p') max C(p' + m). (9) 

For the same reason, the shift SN,i = ^Oi = 2^0i ap¬ 
plied to the correlation maps in 0^’s definition becomes 
simply Oi after subsampling. Let Tt be the shift (or 
translation) operator on the correlation map: 

rt:C{p')^C{p'-t). (10) 

Finally, we incorporate an additional non-linear map¬ 
ping at each iteration on top of Eq. (6) by applying a 
power transform IZx (Malik and Perona 1990; LeCun 
et al 1998a): 

7^A : C(.) ^ C(.)" (11) 

This step, commonly referred to as rectification, is added 
in order to better propagate high correlations after each 
level, or, in other words, to counterbalance the fact that 
max-pooling tends to retain only high scores. Indeed, 
its effect is to decrease the correlation values (which are 
in [0,1]) as we use A > 1. Such post-processing is com¬ 
monly used in deep convolutional networks (LeCun et al 
1998b; Bengio 2009). In practice, good performance is 
obtained with A 1.4, see Section 5. The final expres¬ 
sion of Eq. (6) is: 

Cn,p oSoV) (c« ,) j (12) 
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4x4 atomic patches level 1 



level 2 
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level 3 
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Fig. 6 Computing the multi-level correlation pyramid. Starting with the bottom-level correlation maps, see Figure 3, they are 
iteratively aggregated to obtain the upper levels. Aggregation consists of max-pooling, subsampling, computing a shifted average and 
non-linear rectification. 


Figure 6 illustrates the couiputation of correlation 
maps for different patch sizes and Algorithm 1 sum¬ 
marizes our approach. The resulting set of correlation 
maps across iterations is referred to as multi-level cor¬ 
relation pyramid. 

Boundary effects: In practice, a patch I^ p can overlap 
with the image boundary, as long as its center p remains 
inside the image (from Eq. (7)). For instance, a patch 
In,po with center at po = (0, 0) G has only a single 
valid child (the one for which i = 3 as po + ^n ,3 C I). 
In such degenerate cases, the average sum in Eq. (12) 
is carried out on valid children only. For In,po^ h thus 
only comprises one term weighted by 1 instead of |. 

Note that Eq. (12) implicitly defines the set of pos¬ 
sible displacements of the approach, see Figures 2 and 
9. Given the position of a parent patch, each child patch 
can move only within a small extent, equal to the quar¬ 
ter of its own size. Figure 4 shows the correlation maps 
for patches of size 4, 8 and 16. Clearly, correlation maps 
for larger patch are more and more discriminative, while 
still allowing non-rigid matching. 

3.3 Top-down correspondence extraction 

A score S = Cn,p{p') in the multi-level correlation 
pyramid represents the deformation-tolerant similarity 
of two patches I]sf,p and Since this score is built 

from the similarity of 4 matching sub-patches at the 
lower pyramid level, we can thus recursively backtrack a 
set of correspondences to the bottom level (correspond¬ 
ing to matches of atomic patches). In this section, we 
first describe this backtracking. We, then, present the 
procedure for merging atomic correspondences back¬ 
tracked from different entry points in the multi-level 


Algorithm 1 Computing the multi-level correlation 
pyramid. 

Input: Images /, I' 

For p G ^4 do 

C'4,p = (convolution, Eq. (4)) 

C'4,p ^a(C'4,p) (rectification, Eq. (11)) 

N ^4 

While N < max(VF, H) do 

For p G Qn do 

C'n p “^ (^ ° B)(C]s[^p) (max-pooling and subsampling) 

N ^2N 

For p G Qn do 

Gn,p = I Ef=o 'T'oi (c'jv,p+sjy ,) average) 

Cn,p ^ ^\{Cn,p) (rectification, Eq. (11)) 

Return the multi-level correlation pyramid {Civ,p}^ ^ 


pyramid, which constitute the final output of Deep- 
Matching. 

Compared to our initial version of DeepMatching 
(Weinzaepfel et al 2013), we have updated match scor¬ 
ing and entry point selection to optimize the execution 
time and the matching accuracy. A quantitative com¬ 
parison is provided in Section 5.2.2. 

Backtracking atomic correspondences. Given an entry 
point Cn,p{p') in the pyramid {i.e. a match between 
two patches IN,p and ^ ), we retrieve atomic corre¬ 

spondences by successively undoing the steps used to 
aggregate correlation maps during the pyramid con¬ 
struction, see Figure 7. The entry patch In,p is itself 
composed of four moving quadrants i = 0...3. 

Due to the subsampling, the quadrant ^ = /v , 

^ Note that only roughly corresponds to a A x A square 

patch centered at 2^p' in /', due to subsampling and possible 
deformations. 
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matches with 2 (p'+Oi)+mi where 

ruj = arg max Cn , {2{p' + Oj) + m). (13) 

For the sake of clarity, we define the short-hand no¬ 
tations Pi = p + SAr,i and p[ = 2(p' -f o^) + Let 
B be the function that assigns to a tuple {N^p^p'^ s)^ 
representing a correspondence between pixel p and p' 
for patch of size N with a score s G M, the set of the 
correspondences of children patches: 


B{N, p, p', s) 


{{p,p',s)} ^ 

^ { (f p'i, s + C'^,p, (p'i)) } 


if iV = 4, 
else. 

(14) 


Given a set M. of such tuples, let B(A4) be the union 
of the sets B(c) for all c G TW. Note that if all candidate 
correspondences c e A4 corresponds to atomic patches, 
then B(M) = M. 

Thus, the algorithm for backtracking correspondences 
is the following. Consider an entry match A4 = {(N, p, p', 
GAr,p(pO)}- We repeatedly apply B on J\4. After Tt = 
log2(A/'/4) calls, we get one correspondence for each of 
the 4^ atomic patches. Furthermore, their score is equal 
to the sum of all patch similarities along their back¬ 
tracking path. 


Merging correspondences. We have shown how to re¬ 
trieve atomic correspondences from a match between 
two deformable (potentially large) patches. Despite this 
flexibility, a single match is unlikely to explain the com¬ 
plex set of motions that can occur, for example, between 
two adjacent frames in a video, z.e., two objects mov¬ 
ing independently with significantly different motions 
exceeds the deformation range of DeepMatching. We 
quantitatively specify this range in the next subsection. 

We thus merge atomic correspondences gathered from 
different entry points (matches) in the pyramid. In the 
initial version of DeepMatching (Weinzaepfel et al 2013), 
entry points were local maxima over all correlation maps. 
This is now replaced by a faster procedure, that starts 
with all possible matches in the top pyramid level (z.e. 
M = {(7V,p,p', CnAp'))\N = ^max}). Using this 
level only results in significantly less entry points than 
starting from all maxima in the entire pyramid. We did 
not observe any impact on the matching performance, 
see Section 5.2.2. Because M. contains a lot of overlap¬ 
ping patches, most of the computation during repeated 
calls to Ad ^ S(Ad) can be factorized. In other words, 
as soon as two tuples in Ad are equal in terms of N^ p 
and p', the one with the lowest score is simply elimi¬ 
nated. We thus obtain a set of atomic correspondences 
Ad': 


(15) 



Fig. 8 Matching result between two images with repetitive tex¬ 
tures. Nearly all output correspondences are correct. Wrong 
matches are due to occluded areas (bottom-right of the first 
image) or situations where the deformation tolerance of Deep- 
Matching is exceeded (bottom-left of the first image). 


that we filter with reciprocal match verification. The 
final set of correspondences Ad" is obtained as: 

M” = {(j?,jj',s)|BestAt(j?) = BestAt'(p')}(p,p,,^)g^^, 

(16) 

where Best At (p) (resp. BestAt^(p')) returns the best 
match in a small vicinity of 4 x 4 pixels around p in / 
(resp. around p' in /') from Ad'. 


3.4 Discussion and Analysis of DeepMatching 

Multi-size patches and repetitive textures. During the 
bottom-up pass of the algorithm, we iteratively aggre¬ 
gate correlation maps of smaller patches to form the 
correlation maps of larger patches. Doing so, we effec¬ 
tively consider patches of different sizes (4 x 2^,^ > 0), 
in contrast to most existing matching methods. This 
is a key feature of our approach when dealing with 
repetitive textures. As one moves up to upper levels, 
the matching problem gets less ambiguous. Hence, our 
method can correctly match repetitive patterns, see for 
instance Figure 8. 

Quasi-dense correspondences. Our method retrieves dense 
correspondences for every single match between large 
regions {i.e. entry point for the backtracking in the top- 
level correlation maps), even in weakly textured areas; 
this is in contrast to correspondences obtained when 
matching descriptors {e.g. SIFT). A quantitative as¬ 
sessment, which compares the coverage of matches ob¬ 
tained with several matching schemes, is given in Sec¬ 
tion 5. 

Non-rigid deformations. Our matching algorithm is able 
to cope with various sources of image deformations: 
object-induced or camera-induced. The set of feasible 
deformations, explicitly defined by Eq. (6), theoretically 


Ad' = (So...oS)(Ad) 
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correspondences 
of the quadrants 


First 

image 



Second 

image 










Fig. 7 Backtracking atomic correspondences from an entry point (red dot) in the top pyramid level (left). At each level, the 
backtracking consists in undoing the aggregation performed previously in order to recover the position of the four children patches in 
the lower level. When the bottom level is reached, we obtain a set of correspondences for atomic patches (right). 



Fig. 9 Extent of the tolerance of DeepMatching to deformations. 
From left to right: up-scale of 1.5x, down-scale of 0.5x, rotation 
of 26°. The plain gray (resp. dashed red) square represents the 
patch in the reference (resp. target) image. For clarity, only the 
corner pixels are maximally deformed. 


10000 

8000 

6000 

4000 

2000 


Fig. 10 Histogram over smoothness for identity warping, warp¬ 
ing respecting the built-in constraints in DeepMatching and ran¬ 
dom warping. The x-axis indicates the smoothness value. The 
smoothness value is low when there are few discontinuities, he., 
the warpings are smooth. The histogram is obtained with 10,000 
different artificial warpings. See text for details. 



Identity (null) warping 

Sampled from the set of feasable warpings W 
Random warpings over the same region 





allows to deal with a scaling factor in the range [|, |] 
and rotations approximately in the range [—26*^, 26'^]. 
Note also that DeepMatching is translation-invariant 
by construction, thanks to the convolutional nature of 
the processing. 

Proof Given a patch of size N = Ax 2^ located at level 
^ ^ 1, Eq. (6) allows each of its children patches to move 
by at most N/8 pixels from their ideal location in 0^. 
By recursively summing the displacements at each level, 
the maximal displacements for an atomic patch is = 

= 2^ — 1. An example is given in Figure 9 
with N = 32 and £ = 3. Relatively to we thus 
have limjv^cxD (A" P 2d]sf)/N = | and lim^v^oo (A — 
2d]sf)/N = For a rotation, the rationale is similar, 
see Figure 9. □ 

Note that the displacement tolerance in 0i from Eq. (6) 
could be extended to x x N/S pixels with x G {2, 3,...} 
(instead of x = 1). Then the above formula for com¬ 
puting the lower bound on the scale factor of Deep- 
Matching generalizes to LB (x) = limAT^oo {N—2xd]sf)/N. 


Hence, for x > 2 we obtain LB(x) < 0 instead of 
LB(1) = This implies that the deformation range 
is extended to a point where any patch can be matched 
to a single pixel, z.e., this results in unrealistic defor¬ 
mations. For this reason, we choose to not expand the 
deformation range of DeepMatching. 

Built-in smoothing. Furthermore, correspondences gen¬ 
erated through backtracking of a single entry point in 
the correlation maps are naturally smooth. Indeed, fea¬ 
sible deformations cannot be too “far” from the identity 
deformation. To verify this assumption, we conduct the 
following experiment. We artificially generate two types 
of correspondences between two images of size 128 x 128. 
The first one is completely random, i.e. for each atomic 
patch in the first image we assign randomly a match in 
the second image. The second one respects the back¬ 
tracking constraints. Starting from a single entry point 
in the top level we simulate the backtracking procedure 
from Section 3.3 by replacing in Eq. (13) the max oper¬ 
ation by a random sampling over { —1,0,1}^. By gener- 
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ating 10,000 sets of possible atomic correspondences, we 
simulate a set which respects the deformations allowed 
by DeepMatching. Figure 10 compares the smooth¬ 
ness of these two types of artificial correspondences. 
Smoothness is measured by interpreting the correspon¬ 
dences as flow and measuring the gradient flow norm, 
see Eq. (19). Clearly, the two types of warpings are 
different by orders of magnitude. Furthermore, the one 
which respects the built-in constraints of DeepMatching 
is close to the identity warping. 

Relation to Deep Convolutional Neural Networks (CNNs). 
DeepMatching relies on a hierarchical, multi-layer, cor¬ 
relational architecture designed for matching images 
and was inspired by deep convolutional approaches (Le- 
Cun et al 1998a). In the following we describe the major 
similarities and differences. 

Deep networks learn from data the weights of the 
convolutions. In contrast, DeepMatching does not learn 
any feature representations and instead directly com¬ 
putes correlations at the patch level. It uses patches 
from the first image as convolution filters for the second 
one. However, the bottom-up pipeline of DeepMatching 
is similar to CNNs. It alternates aggregating channels 
from the previous layer with channel-wise max-pooling 
and subsampling. As in CNNs, max-pooling in Deep- 
Matching allows for invariance w.r.t. small deforma¬ 
tions. Likewise, the algorithm propagates pairwise patch 
similarity scores through the hierarchy using non-linear 
rectifying stages in-between layers. Finally, DeepMatching 
includes a top-down pass which is not present in CNNs. 


Time and spaee eomplexity. DeepMatching has a com¬ 
plexity 0{LL') in memory and time, where L = WH 
and L' = W'H' are the number of pixels per image. 

Proof Computing the initial correlations is a 0{LL') 
operation. Then, at each level of the pyramid, the pro¬ 
cess is repeated while the complexity is divided by a 
factor 4 due to the subsampling step in the target image 
(since the cardinality of \{Gn}\ remains approximately 
constant). Thus, the total complexity of the correlation 
maps computation is, at worst, 0 (^^qI/I/'/4’^) = 
0{LL'). During the top-down pass, most backtracking 
paths can be pruned as soon as they cross a concurrent 
path with a higher score (see Section 3.3). Thus, all 
correlations will be examined at most once, and there 
are values in total. However, this analy¬ 

sis is worst-case. In practice, only correlations lying on 
maximal paths are actually examined. □ 


4 Extensions of DeepMatching 

4.1 Approximate DeepMatching 

As a consequence of its 0{LL') space complexity, Deep- 
Matching requires an amount of RAM that is orders of 
magnitude above other state-of-the-art matching meth¬ 
ods. This could correspond to several gigabytes for im¬ 
ages of moderate size (800x600 pixels); see Section 5.2.3. 
This section introduces an approximation of DeepMatching 
that allows to trade matching quality for reduced time 
and memory usage. As shown in Section 5.2.3, near- 
optimal results can be obtained at a fraction of the 
original cost. 

Our approximation proposes to compress the rep¬ 
resentation of atomic patches Atomic patches 

carry little information, and thus are highly redundant. 

For instance, in uniform regions, all patches are nearly 
identical (z.e., gradient-wise). To exploit this property, 
we index atomic patches with a small set of patch proto¬ 
types. We substitute each patch with its closest neigh¬ 
bor in a fixed dictionary of D prototypes. Hence, we 
need to perform and store only D convolutions at the 
first level, instead of 0{L) (with D <C 0{L)). This sig¬ 
nificantly reduces both memory and time complexity. 
Note that higher pyramid levels also benefit from this 
optimization. Indeed, two parent patches at the second 
level have the exact same correlation map in case their 
children are assigned the same prototypes. The same 
reasoning also holds for all subsequent levels, but the 
gains rapidly diminish due to statistical unlikeliness of 
the required condition. This is not really an issue, since 
the memory and computational cost mostly rests on the 
initial levels; see Section 3.4. 

In practice, we build the prototype dictionary us¬ 
ing k-means, as it is designed to minimize the approx¬ 
imation error between input descriptors and resulting 
centroids (z.e. prototypes). Given a pair of images to 
match, we perform on-line clustering of all descriptors 
of atomic patches {/ 4 ,p} = {R} in the first image. Since 
the original descriptors lie on an hypersphere (each pixel 
descriptor Tiij has norm I), we modify the k-means ap¬ 
proach so as to project the estimated centroids on the 
hypersphere at each iteration. We find experimentally 
that this is important to obtain good results. 


4.2 Scale and rotation invariant DeepMatching 

For a variety of tasks, objects to be matched can appear 
under image rotations or at different scales (Lowe 2004; 
Mikolajczyk et al 2005; Szeliski 2010; HaCohen et al 
2011). As discussed above, DeepMatching (DM) is only 
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robust to moderate scale changes and rotations. We now 
present a scale and rotation invariant version. 

The approach is straightforward: we apply DM to 
several rotated and scaled versions of the second im¬ 
age. According to the invariance range of DM, we use 
steps of 7r/4 for image rotation and power of \/2 for 
scale changes. While iterating over all combinations of 
scale changes and rotations, we maintain a list M.' of 
all atomic correspondences obtained so far, i.e. corre¬ 
sponding positions and scores. As before, the final out¬ 
put correspondences consists of the reciprocal matches 
in M'. Storing all matches and finally choosing the best 
ones based on reciprocal verification permits to capture 
distinct motions possibly occurring together in the same 
scene {e.g. one object could have undergone a rotation, 
while the rest of the scene did not move). The steps of 
the approach are described in Algorithm 2. 

Since we iterate sequentially over a fixed list of rota¬ 
tions and scale changes, the space and time complexity 
of the algorithm remains unchanged {i.e. 0{LL')). In 
practice, the run-time compared to DM is multiplied by 
a constant approximately equal to 25, see Section 5.2.4. 
Note that the algorithm permits a straightforward par¬ 
allelization. 


Algorithm 2 Scale and rotation invariant version of 
DeepMatching (DM). denotes the image / downsized 
by a factor cr, and IZe denotes rotation by an angle 0. 
Input: /, r are the images to be matched 
Initialize an empty set Ai' = {} of correspondences 
For cr G {-2, -1.5 ... , 1.5, 2} do 

cTi ^ max (1,2+'^) ^ either downsize image 1 
(72 ^ max (1,2“'^) ^ or downsize image 2 
For 0G {0,f do 

get raw atomic correspondences (Eq. (15)) 

M'^ q ^ DeepMatching {la^,n_e * I'a^) 

^ Geometric rectification to the input image space: 

^ s) I V(p,p',s) G 

# Concatenate results: 

M' ^ 

J\A" ^ reciprocal(At') # keep reciprocal matches (Eq. (16)) 
Return M" 


4.3 DeepFlow 

We now present our approach for optical flow estima¬ 
tion, DeepFlow. We adopt the method introduced by 
Brox and Malik (2011), where a matching term pe¬ 
nalizes the differences between optical flow and input 
matches, and replace their matching approach by Deep- 
Matching. In addition, we make a few minor modifica¬ 
tions introduced recently in the state of the art: (i) we 


add a normalization in the data term to downweight 
the impact of locations with high spatial image deriva¬ 
tives (Zimmer et al 2011); (ii) we use a different weight 
at each level to downweight the matching term at finer 
scales (Stoll et al 2012); and (hi) the smoothness term 
is locally weighted (Xu et al 2012). 

Let /i,/2 : i? ^ be two consecutive images de¬ 
fined on Q with c channels. The goal is to estimate the 
flow m = {u^vY • 12 ^ We assume that the images 
are already smoothed using a Gaussian filter of stan¬ 
dard deviation cr. The energy we optimize is a weighted 
sum of a data term Ejj, a, smoothness term Es and a 
matching term Em' 

E{w) = f Ed aEsPEMdx (17) 

J Q 

For the three terms, we use a robust penalizer E{s^) = 
with e = 0.001 which has shown excellent re¬ 
sults (Sun et al 2014b). 

Data term. The data term is a separate penalization 
of the color and gradient constancy assumptions with 
a normalization factor as proposed by Zimmer et al 
(2011). We start from the optical flow constraint as¬ 
suming brightness constancy: {VjI)w = 0 with V3 = 
(5x, dy^ dt)~^ the spatio-temporal gradient. A basic way 
to build a data term is to penalize it, i.e. Ed = E{vrE Jqw) 
with Jo the tensor defined by Jo = (V3/)(Vj/). As 
highlighted by Zimmer et al (2011), such a data term 
adds a higher weight in locations corresponding to high 
spatial image derivatives. We normalize it by the norm 
of the spatial derivatives plus a small factor to avoid 
division by zero, and to reduce a bit the influence in 
tiny gradient locations (Zimmer et al 2011). Let Jo be 
the normalized tensor Jo = ^O'^o with ^0 = (IIV2/IP + 
We set C = 0.1 in the following. To deal with 
color images, we consider the tensor defined for a chan¬ 
nel i denoted by upper indices Jq and we penalize the 
sum over channels: J^w). We consider im¬ 

ages in the RGB color space. 

We separately penalize the gradient constancy as¬ 
sumption (Bruhn et al 2005). Let and ly be the 
derivatives of the images with respect to the x and y 
axis respectively. Let J^y be the tensor for the channel 
i including the normalization 

4, = (V34)(vJ4)/(IIV24I|Gc') 

+ (V3/;)(vJ/;)/(IIV24iiGc'). 

The data term is the sum of two terms, balanced by 
two weights S and 7: 

Ed = 5^ j + j (18) 
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Smoothness term. The smoothness term is a robust pe¬ 
nalization of the gradient flow norm: 

^5 = (||V^/||2 + ||V^;||2) . (19) 

The smoothness weight a is locally set according to 
image derivatives (Wedel et al 2009; Xu et al 2012) 
with a{x) = ex.-p{—K\/2l{x)) where k is experimentally 
set to n = b. 

Matching term. The matching term encourages the flow 
estimation to be similar to a precomputed vector field 
w' . To this end, we penalize the difference between w 
and w' using the robust penalizer Since the matching 
is not totally dense, we add a binary term c{x) which 
is equal to 1 if and only if a match is available at x. 

We also multiply each matching penalization by a 
weight 0 (cc), which is low in uniform regions where 
matching is ambiguous and when matched patches are 
dissimilar. To that aim, we rely on A(cc), the mini¬ 
mum eigenvalue of the autocorrelation matrix multi¬ 
plied by 10. We also compute the visual similarity be¬ 
tween matches as Z\(cc) = YTi=i 

\VI{{x)—Vl2{x—w'{x))\. We then compute the score 0 
as a Gaussian kernel on A weighted by A with a param¬ 
eter (Jm, experimentally set to ctm = 50. More precisely, 
we define (j){x) at each point x with a match w'{x) as: 


4>{x) = y A(a3)/(crM\/^) exp(-Z\(a;)/2(7M)- 

The matching term is then Em = c(l)E{\\w — w'\\‘^). 

Minimization. This energy objective is non-convex and 
non-linear. To solve it, we use a numerical optimiza¬ 
tion algorithm similar as Brox et al (2004). An incre¬ 
mental coarse-to-fine warping strategy is used with a 
downsampling factor 77 = 0.95. The remaining equa¬ 
tions are still non-linear due to the robust penalizers. 
We apply 5 inner fixed point iterations where the non¬ 
linear weights and the flow increments are iteratively 
updated while fixing the other. To approximate the so¬ 
lution of the linear system, we use 25 iterations of the 
Successive Over Relaxation (SOR) method (Young and 
Rheinboldt 1971). 

To downweight the matching term on fine scales, we 
use a different weight at each level as proposed by 
Stoll et al ( 2012 ). We set = P{k/kma^)^ where k is 
the current level of computation, /cmax the coarsest level 
and b a parameter which is optimized together with the 
other parameters, see Section 5.3.1. 


5 Experiments 

This section presents an experimental evaluation of Deep- 
Matching and DeepFlow. The datasets and metrics used 
to evaluate DeepMatching and DeepFlow are introduced 
in Section 5.1. Experimental results are given in Sec¬ 
tions 5.2 and 5.3 respectively. 


5.1 Datasets and metrics 

In this section we briefly introduce the matching and 
flow datasets used in our experiments. Since consecutive 
frames of a video are well-suited to evaluate a match¬ 
ing approach, we use several optical flow datasets for 
evaluating both the quality of matching and flow, but 
we rely on different metrics. 

The Mikolajczyk dataset was originally proposed by 
Mikolajczyk et al (2005) to evaluate and compare the 
performance of keypoint detectors and descriptors. It is 
one of the standard benchmarks for evaluating match¬ 
ing approaches. The dataset consists of 8 sequences of 
6 images each viewing a scene under different condi¬ 
tions, such as illumination changes or viewpoint changes. 
The images of a sequence are related by homographies. 
During the evaluation, we comply to the standard pro¬ 
cedure in which the first image of each scene is matched 
to the 5 remaining ones. Since our goal is to study ro¬ 
bustness of DeepMatching to geometric distortions, we 
follow HaCohen et al (2011) and restrict our evalua¬ 
tion to the 4 most difficult sequences with viewpoint 
changes: bark, boat, graf and wall. 

The MPI-Sintel dataset (Butler et al 2012) is a chal¬ 
lenging evaluation benchmark for optical flow estima¬ 
tion, constructed from realistic computer-animated films, 
The dataset contains sequences with large motions and 
specular reflections. In the training set, more than 17.5% 
of the pixels have a motion over 20 pixels, approxi¬ 
mately 10% over 40 pixels. We use the “final” version, 
featuring rendering effects such as motion blur, defocus 
blur and atmospheric effects. Note that ground-truth 
optical flows for the test set are not publicly available. 

The Middlebury dataset (Baker et al 2011) has been 
extensively used for evaluating optical flow methods. 
The dataset contains complex motions, but most of the 
motions are small. Less than 3% of the pixels have a 
motion over 20 pixels, and no motion exceeds 25 pixels 
(training set). Ground-truth optical flows for the test 
set are not publicly available. 
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The Kitti dataset Geiger et al (2013) contains real- 
world sequences taken from a driving platform. The 
dataset includes non-Lambertian surfaces, different light¬ 
ing conditions, a large variety of materials and large 
displacements. More than 16% of the pixels have mo¬ 
tion over 20 pixels. Again, ground-truth optical flows 
for the test set are not publicly available. 

Performanee metric for matching. Choosing a per¬ 
formance measure for matching approaches is delicate. 
Matching approaches typically do not return dense cor¬ 
respondences, but output varying numbers of matches. 
Furthermore, correspondences might be concentrated in 
different areas of the image. 

Most matching approaches, including DeepMatching, 
are based on establishing correspondences between patches. 
Given a pair of matching patches, it is possible to ob¬ 
tain a list of pixel correspondences for all pixels within 
the patches. We introduce a measure based on the num¬ 
ber of correctly matched pixels compared to the over¬ 
all number of pixels. We define “accuracy@T” as the 
proportion of “correct” pixels from the first image with 
respect to the total number of pixels. A pixel is con¬ 
sidered correct if its pixel match in the second image 
is closer than T pixels to ground-truth. In practice, we 
use a threshold of T = 10 pixels, as this represents a 
sufficiently precise estimation (about 1% of image di¬ 
agonal for all datasets), while allowing some tolerance 
in blurred areas that are difficult to match exactly. If a 
pixel belongs to several matches, we choose the one with 
the highest score to predict its correspondence. Pixels 
which do not belong to any patch have an infinite error. 

Performance metric for optical flow. To evaluate opti¬ 
cal flow, we follow the standard protocol and measure 
the average endpoint error over all pixels, denoted as 
“EPE”. The “slO-40” variant measures the EPE only 
for pixels with a ground-truth displacement between 
10 and 40 pixels, and likewise for “sO-10” and “s40+”. 

In all cases, scores are averaged over all image pairs to 
yield the final result for a given dataset. 


5.2 Matching Experiments 

In this section, we evaluate DeepMatching (DM). We 
present results for all datasets presented above but Mid- 
dlebury, which does not feature long-range motions, the 
main difficulty in image matching. When evaluating on 
the Mikolajczyk dataset, we employ the scale and rota¬ 
tion invariant version of DM presented in Section 4.2. 
Eor all the matching experiments reported in this sec¬ 
tion, we use the Mikolajczyk dataset and the training 
sets of MPI-Sintel and Kitti. 


Mikolajczyk dataset 
MPI-Sintel (final) 
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MPI-Sintel (final) 
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Fig. 11 Impact of the parameters to compute pixel descriptors 
on the different datasets. 


5. 2.1 Impact of the parameters 

We optimize the different parameters of DM jointly on 
all datasets. To prevent overfitting, we use the same 
parameters across all datasets. 


Pixel descriptor parameters: We first optimize the pa¬ 
rameters of the pixel representation (Section 3.2): z/i, 

^ 2 , ^3 (different smoothing stages), ^ (sigmoid slope) 
and fi (regularization constant). After performing a grid 
search, we find that good results are obtained at ni = 

^2 = ^3 = ^ = 0-2 and /i = 0.3 across all datasets. 

Eigure 11 shows the accuracy@10 in the neighborhood 
of these values for all parameters. Image pre-smoothing 
seems to be crucial for JPEG images (Mikolajczyk dataset), 
as it smooths out compression artifacts, whereas it slightly 
degrades performance for uncompressed PNG images 
(MPI-Sintel and Kitti). As expected, similar findings 
are observed for the regularization constant p since it 
acts as a regularizer that reduces the impact of small 
gradients {i.e. noise). In the following, we thus use low 
values of z/i and p when dealing with PNG images (we 
set z/i = 0 and p = 0.1, other parameters are un¬ 
changed) . 


Non-linear rectification: We also evaluate the impact 
of the parameter A of the non-linear rectification ob¬ 
tained by applying power normalization, see Eq. (11). 
Eigure 12 displays the accuracy@10 for various values 
of A. We can observe that the optimal performance is 
achieved at A = 1.4 for all datasets. We use this value 
in the remainder of our experiments. 
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Fig. 12 Impact of the non-linear response rectification (eq. (11)). 


R 

New BT 
entry points 

New scoring 
scheme 

accuracy® 10 

memory 

usage 

matching 

time 

Mikolajczyk dataset 

1/4 



0.620 

0.9 GB 

1.0 min 

1/2 



0.848 

5.5 GB 

20 min 

1/2 


/ 

0.864 

5.5 GB 

7.3 min 

1/2 

/ 

/ 

0.878 

4.4 GB 

6.3 min 

MPI-Sintel dataset (final) 

1/4 



0.822 

0.4 GB 

2.4 sec 

1/2 



0.880 

6.3 GB 

55 sec 

1/2 


/ 

0.890 

6.3 GB 

16 sec 

1/2 

/ 

/ 

0.892 

4.6 GB 

16 sec 

Kitti dataset 

1/4 



0.772 

0.4 GB 

2.0 sec 

1/2 



0.841 

6.3 GB 

39 sec 

1/2 


/ 

0.855 

6.3 GB 

14 sec 

1/2 

/ 

/ 

0.856 

4.7 GB 

14 sec 


Table 1 Detailed comparison between the preliminary and cur¬ 
rent versions of DeepMatching in terms of performance, run-time 
and memory usage. R denotes the input image resolution and BT 
backtracking. Run-times are computed on 1 core @3.6 GHz. 

5.2.2 Evaluation of the backtracking and scoring 
schemes 

We now evaluate two improvements of DM with respect 
to the previous version published in Weinzaepfel et al 
(2013), referred to as DM*: 

— Backtracking (BT) entry points: in DM* we select as 
entry points local maxima in the correlation maps 
from all pyramid levels. The new alternative is to 
start from all possible points in the top pyramid 
level. 

— Scoring scheme: In DM* we scored atomic corre¬ 
spondences based on the correlation values of start 
and end point of the backtracking path. The new 
scoring scheme is the sum of correlation values along 
the full backtracking path. 

We report results for the different variants in Table 1 
on each dataset. The first two rows for each dataset cor¬ 
respond to the exact settings used for DM* {i.e. with an 
image resolution of 1/4 and 1/2). We observe a steady 
increase in performance on all datasets when we add 
the new scoring and backtracking approach. We can ob¬ 
serve that starting from all possible entry points in the 
top pyramid level {i.e. considering all possible transla¬ 
tions) yields slightly better results than starting from 
local maxima. This demonstrates that some ground- 
truth matches are not covered by any local maximum. 


By enumerating all possible patch translations from the 
top-level, we instead ensure to fully explore the space 
of all possible matches. 

Furthermore, it is interesting to note that memory 
usage and run-time significantly decreases when using 
the new options. This is because (1) searching and stor¬ 
ing local maxima (which are exponentially more numer¬ 
ous in lower pyramid levels) is not necessary anymore, 
and (2) the new scoring scheme allows for further opti¬ 
mization, i.e. early pruning of backtracking paths (Sec¬ 
tion 3.3). 

5.2.3 Approximate DeepMatching 

We now evaluate the performance of approximate Deep- 
Matching (Section 4.1) and report its run-time and 
memory usage. We evaluate and compare two differ¬ 
ent ways of reducing the computational load. The first 
one simply consists in downsizing the input images, and 
upscaling the resulting matches accordingly. The sec¬ 
ond option is the compression scheme proposed in Sec¬ 
tion 4.1. 

We evaluate both schemes jointly by varying the in¬ 
put image size (expressed as a fraction R of the original 
resolution) and the size D of the prototype dictionary 
{i.e. parameter of k-means in Section 4.1). R = 1 cor¬ 
responds to the original dataset image size (no down¬ 
sizing). We display the results in terms of matching 
accuracy (accuracy© 10) against memory consumption 
in Figure 13 and as a function of D in Figure 14. Fig¬ 
ure 13 shows that DeepMatching can be computed in 
an approximate manner for any given memory budget. 
Unsurprisingly, too low settings {e.g. R < 1/8, D < 64) 
result in a strong loss of performance. It should be noted 
that that we were unable to compute DeepMatching at 
full resolution {R = 1) for D > 64, as the memory con¬ 
sumption explodes. As a consequence, all subsequent 
experiments in the paper are done at R = 1/2. In Fig¬ 
ure 14, we observe that good trades-off are achieved 
for dictionary sizes comprised in D G [64,1024]. For in¬ 
stance, on MPI-Sintel, at D = 1024, 94% of the perfor¬ 
mance of the uncompressed case (D = oc) is reached for 
half the computation time and one third of the memory 
usage. Detailed timings of the different stages of Deep- 
Matching are given in Table 2. As expected, only the 
bottom-up pass is affected by the approximation, with a 
run-time of the different operations involved (patch cor¬ 
relations, max-pooling, subsampling, aggregation and 
non-linear rectification) roughly proportional to D (or 
to 1 04 1, the actual number of atomic patches, if D = oo). 

The overhead of clustering the dictionary prototypes 
with k-means appears negligible, with the exception of 
the largest dictionary size {D = 4096) for which it in- 
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(a) Mikolajczyk dataset 



(b) MPI-Sintel dataset (final version) 
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Fig. 14 Performance, memory usage and run-time for different 
levels of compression corresponding to the size D of the prototype 
dictionary (we set the image resolution to = 1/2). A dictionary 
size D = oo stands for no compression. Run-times are for a single 
image pair on 1 core @3.6 GHz. 


Fig. 13 Trade-off between memory consumption and matching 
performance for the different datasets. Memory usage is con¬ 
trolled by changing image resolution R (different curves) and 
dictionary size D (curve points). 

duces a slightly longer run-time than in the uncom¬ 
pressed case. Overall, the proposed method for approx¬ 
imating DeepMatching is highly effective. 


GPU Implementation. We have implemented DM on 
GPU in the Caffe framework (Jia et al 2014). Us¬ 
ing existing Caffe layers like ConvolutionLayer and 
PoolingLayer, the implementation is straightforward for 
most layers. We had to specihcahy code a few layers 
which are not available in Caffe {e.g. the backtracking 
pass^). For the aggregation layer which consists in se¬ 
lecting and averaging 4 children channels out of many 
channels, we relied on the sparse matrix multiplication 
in the cuSPARSE toolbox. Detailed timings are given in 
Table 2 on a GeForce Titan X. Our code runs in about 
0.2s for a pair of MPI-Sintel image. As expected, the 
computation bottleneck essentially lies in the compu¬ 
tation of bottom-level patch correlations and the back¬ 
tracking pass. Note that computing patch descriptors 
takes signihcantly more time, in proportion, than on 
CPU: it takes about 0.024s = 11% of total time (not 
shown in table). This is because it involves a succes¬ 
sion of many small layers (image smoothing, gradient 
extraction and projection, etc.), which causes overhead 
and is rather inefficient. 


^ Although the backtracking is conceptually close to the back- 
propagation training algorithm, it differs in term of how the scores 
are accumulated for each path. 


Unit 

R 

D 

Patch 

clustering 

Patch 

Correlations 

Max-pooling 

+subsainpling 

Aggre¬ 

gation 

Non-linear 

rectification 

Back¬ 

tracking 

Total 

CPU 

1/2 

64 

0.3 

0.2 

0.4 

0.9 

0.8 

5.1 

7.7 

CPU 

1/2 

1024 

1.3 

0.7 

0.6 

1.0 

1.3 

5.8 

10.7 

CPU 

1/2 

CX3 


4.3 

1.6 

1.0 

3.2 

6.2 

16.4 

GPU 

1/2 

OO 


0.084 

0.012 

0.017 

0.013 

0.053 

0.213 


Table 2 Detailed timings of the different stages of Deep- 
Matching, measured for a single image pair from MPI-Sintel on 
CPU (1 core @ 3.6GHz) and GPU (GeForce Titan X) in sec¬ 
onds. Stages are: patch clustering (only for approximate DM, see 
Section 4.1), patch correlations (Eq. (4)), joint max-pooling and 
subsampling, correlation map aggregation, non linear rectifica¬ 
tion (resp. S oV, and Rx in Eq. (12)), and correspon¬ 

dence backtracking (Section 3.3). Other operations (e.g. recipro¬ 
cal verification of Eq. (16)) have negligible run-time. For opera¬ 
tions applied at several levels like the non-linear rectification, a 
cumulative timing is given. 


5 . 2.4 Comparison to the state of the art 

We compare DM with several baselines and state-of- 
the-art matching algorithms, namely: 

— SIFT keypoints extracted with DoG detector (Lowe 
2004) and matched with FLANN (Muja and Lowe 
2009), referred to as SIFT-NN,^ 

— dense HOG matching, followed by nearest-neighbor 
matching with reciprocal verffication as done in LDOF 
(Brox and Malik 2011), referred to as HOG-NN"^, 

— Generalized PatchMatch (GPM) (Barnes et al 2010), 
with default parameters, 32x32 patches and 20 iter¬ 
ations (best settings in our experiments)^, 

— Kd-tree PatchMatch (KPM) (Sun 2012), an improved 
version of PatchMatch based on better patch de¬ 
scriptors and kd-trees optimized for correspondence 
propagation"^. 


We implemented this method ourselves. 
^ We used the online code. 
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— Non-Rigid Dense Correspondences (NRDC) (HaCo- 
hen et al 2011), an improved version of GPM based 
on a multiscale iterative expansion/contraction strat- 

egy®, 

— SIFT-flow (Lin et al 2011), a dense matching algo¬ 
rithm based on an energy minimization where pixels 
are represented as SIFT features and a smoothness 
term is incorporated to explicitly preserve spatial 
discontinuities^, 

— Scale-less SIFT (SLS) (Hassner et al 2012), an im¬ 
provement of SIFT-flow to handle scale changes (mul¬ 
tiple sized SIFTs are extracted and combined to 
form a scale-invariant pixel representation)^, 

— DaisyFilterFlow (DaisyFF) (Yang et al 2014), a dense 
matching approach that combines filter-based effi¬ 
cient flow inference and the Patch-Match fast search 
algorithm to match pixels described using the DAISY 
representation (Tola et al 2010)^, 

— Deformable Pyramid Matching (DSP) (Kim et al 
2013), a dense matching approach based on a coarse- 
to-fine (top-down) strategy where inference is per¬ 
formed with (inexact) loopy belief propagation^. 

SIFT-NN, HOG-NN and DM output sparse matches, 
whereas the other methods output fully dense corre¬ 
spondence fields. SIFT keypoints, GPM, NRDG and 
DaisyFF are scale and rotation invariant, whereas HOG- 
NN, 

KPM, SIFT-flow, SLS and DSP are not. We, there¬ 
fore, do not report results for these latter methods on 
the Mikolajczyk dataset which includes image rotations 
and scale changes. 

Statistics about each method (average number of 
matches per image and their coverage) are reported 
in Table 3. Coverage is computed as the proportion of 
points on a regular grid with 10 pixel spacing for which 
there exists a correspondence (in the raw output of the 
considered method) within a 10 pixel neighborhood. 
Thus, it measures how well matches “cover” the image. 
Table 3 shows that DeepMatching outputs 2 to 7 times 
more matches than SIFT-NN and a comparable num¬ 
ber to HOG-NN. Yet, the coverage for DM matches 
is much higher than for HOG-NN and SIFT-NN. This 
shows that DM matches are well distributed over the 
entire image, which is not the case for HOG-NN and 
SIFT-NN, as they have difficulties estimating matches 
in regions with weak or repetitive textures. 

Quantitative results are listed in Table 4, and qual¬ 
itative results in Figures 15, 16 and 17. Overall, DM 
significantly outperforms all other methods, even when 
reduced settings are used {e.g. for image resolution R = 
1/2 and D = 1024 prototypes). As expected, SIFT- 

® We report results from the original paper. 


Method 

Mikolajczyk 

MPI-Sintel (flnal) 

Kitti 

# 

coverage 

# 

coverage 

# 

coverage 

SIFT-NN 

2084 

0.59 

836 

0.25 

1299 

0.38 

HOG-NN 

- 

- 

4576 

0.39 

4293 

0.34 

KPM 

- 

- 

446K 

1 

462K 

1 

GPM 

545K 

1 

446K 

1 

462K 

1 

NRDG 

545K 

1 

446K 

1 

462K 

1 

SIFT-flow 

- 

- 

446K 

1 

462K 

1 

SLS 

- 

- 

446K 

1 

462K 

1 

DaisyFF 

545K 

1 

446K 

1 

462K 

1 

DSP 

- 

- 

446K 

1 

462K 

1 

DM (ours) 

3120 

0.81 

5920 

0.96 

5357 

0.88 


Table 3 Statistics of the different matching methods. The 
column refers to the average number of matches per image, and 
the coverage to the proportion of points on a regular grid with 
10 pixel spacing that have a match within a lOpx neighborhood. 


method 

R 

D 

accuracy @10 

memory 

usage 

matching 

time 

Mikolajczyk dataset 

SIFT-NN 



0.674 

0.2 GB 

1.4 sec 

GPM 




0.303 

0.1 GB 

2.4 min 

NRDC 




0.692 

0.1 GB 

2.5 min 

DaisyFF 




0.410 

6.1 GB 

16 min 

DM 


1/4 

oc 

0.657 

0.9 GB 

38 sec 

DM 


1/2 

1024 

0.820 

1.5 GB 

4.5 min 

DM 


1/2 

oo 

0.878 

4.4 GB 

6.3 min 

MPI-Sintel dataset (final) 

SIFT-NN 



0.684 

0.2 GB 

2.7 sec 

HOG-NN 



0.712 

3.4 GB 

32 sec 

KPM 




0.738 

0.3 GB 

7.3 sec 

GPM 




0.812 

0.1 GB 

1.1 min 

SIFT-flow 



0.890 

1.0 GB 

29 sec 

SLS 




0.824 

4.3 GB 

16 min 

DaisyFF 




0.873 

6.8 GB 

12 min 

DSP 




0.853 

0.8 GB 

39 sec 

DM 


1/4 

oo 

0.835 

0.3 GB 

1.6 sec 

DM 


1/2 

1024 

0.869 

1.8 GB 

10 sec 

DM 


1/2 

oo 

0.892 

4.6 GB 

16 sec 

Kitti dataset 

SIFT-NN 



0.489 

0.2 GB 

1.7 sec 

HOG-NN 



0.537 

2.9 GB 

24 sec 

KPM 




0.536 

0.3 GB 

17 sec 

GPM 




0.661 

0.1 GB 

2.7 min 

SIFT-flow 



0.673 

1.0 GB 

25 sec 

SLS 




0.748 

4.4 GB 

17 min 

DaisyFF 




0.796 

7.0 GB 

11 min 

DSP 




0.580 

0.8 GB 

2.9 min 

DM 


1/4 

oo 

0.800 

0.3 GB 

1.6 sec 

DM 


1/2 

1024 

0.812 

1.7 GB 

10 sec 

DM 


1/2 

oo 

0.856 

4.7 GB 

14 sec 


Table 4 Matching performance, run-time and memory usage 
for state-of-the-art methods and DeepMatching (DM). For the 
proposed method, R and D denote the input image resolution 
and the dictionary size (oo stands for no compression). Run-times 
are computed on 1 core @3.6 GHz. 


NN performs rather well in presence of global image 
transformation (Mikolajczyk dataset), but yields the 
worst result for the case of more complex motions (flow 
datasets). Figures 16 and 17 illustrate the reason: SIFT’s 
large patches are way too coarse to follow motion bound¬ 
aries precisely. The same issue also holds for HOG-NN. 
Methods predicting dense correspondence fields return 









































DeepMatching: Hierarchical Deformable Dense Matching 


17 


a more precise estimate, yet most of them (KPM, GPM, 
SIFT-flow, DSP) are not robust to repetitive textures 
in the Kitti dataset (Figure 17) as they rely on weakly 
discriminative small patches. Despite this limitation, 
SIFT-flow and DSP are still able to perform well on 
MPI-Sintel as this dataset contains little scale changes. 
Other dense methods, NRDC, SLS and DaisyFF, can 
handle patches of different sizes and thus perform bet¬ 
ter on Kitti. But in turn this is at the cost of reduced 
performance on the MPI-Sintel or Mikolajczyk datasets 
(qualitative results are in Figure 15). In conclusion, DM 
outperforms all other methods on the 3 datasets, includ¬ 
ing DSP which also relies on a hierarchical matching. 

In terms of computing resources, DeepMatching with 
full settings {R = 1/2, D = oo) is one of the most 
costly method (only SLS and DaisyFF require the same 
order of memory and longer run-time). The scale and 
rotation invariant version of DM, used for the Miko¬ 
lajczyk dataset, is slow compared to most other ap¬ 
proaches, due to its sequential processing {i.e. treating 
each combination of rotation and scaling sequentially), 
yet yields near perfect results. However, running DM 
with reduced settings is very competitive to the other 
approaches. On MPI-Sintel and Kitti, for instance, DM 
with a quarter resolution has a run-time comparable to 
the fastest method, SIFT-NN, with a reasonable mem¬ 
ory usage, while still outperforming nearly all methods 
in terms of the accuracy@10 measure. 


5.3 Optical Flow Experiments 

We now present experimental results for the optical flow 
estimation. Optical flow is predicted using the varia¬ 
tional framework presented in Section 4.3 that takes as 
input a set of matches. In the following, we evaluate the 
impact of DeepMatching against other matching meth¬ 
ods, and compare to the state of the art. 

5.3.1 Optimization of the parameters 

We optimize the parameters of DeepFlow on a sub¬ 
set of the MPI-Sintel training set (20%), called “small” 
set, and report results on the remaining image pairs 
(80%, called “validation set”) and on the training sets 
of Kitti and Middlebury. Ground-truth optical flows for 
the three test sets are not publicly available, in order 
to prevent parameter tuning on the test set. 

We first optimize the different flow parameters (/d, 
7 , d, cr and b) by employing a gradient descent strat¬ 
egy with multiple initializations followed by a local grid 
search. For the data term, we find an optimum at (5 = 0, 


Method 

1 ^ 

1 D 

MPI-Sintel 

Kitti 

Middlebury 

No Match 

5.863 

8.791 

0.274 

SIFT-NN 



5.733 

7.753 

0.280 

HOG-NN 



5.458 

8.071 

0.273 

KPM 



5.560 

15.289 

0.275 

GPM 



5.561 

17.491 

0.286 

SIFT-flow 


5.243 

12.778 

0.283 

SLS 



5.307 

10.366 

0.288 

DaisyFF 



5.145 

10.334 

0.289 

DSP 



5.493 

15.728 

0.283 

DM 

1/2 

1024 

4.350 

7.899 

0.320 

DM 

1/2 

oo 

4.098 

4.407 

0.328 


Table 5 Comparison of average endpoint error on different 
datasets when changing the input matches in the flow compu¬ 
tation. 


which is equivalent to removing the color constancy as¬ 
sumption. This can be explained by the fact that the 
“final” version contains atmospheric effects, reflections, 
blurs, etc. The remaining parameters are optimal at 
P = 300, 7 = 0.8, (T = 0.5, b = 0.6. These parameters 
are used in the remaining of the experiments for Deep- 
Flow, i.e. using matches obtained with DeepMatching, 
except when reporting results on Kitti and Middlebury 
test sets in Section 5.3.3. In this case the parameters 
are optimized on their respective training set. 

5.3.2 Impact of the matches on the flow 

We examine the impact of different matching methods 
on the flow, i.e., different matches are used in Deep- 
Flow, see Section 4.3. For all matching approaches eval¬ 
uated in the previous section, we use their output as 
matching term in Eq. (17). Because these approaches 
may output matches with statistics different from DM, 
we separately optimize the flow parameters for each 
matching approach on the small training set of MPI- 
SinteF. 

Table 5 shows the endpoint error, averaged over all 
pixels. Clearly, a sufficiently dense and accurate match¬ 
ing like DM allows to considerably improve the flow 
estimation on datasets with large displacements (MPI- 
Sintel, Kitti). In contrast, none of the methods pre¬ 
sented have a tangible effect on the Middlebury dataset, 
where the displacements are small. 

The relatively small gains achieved by SIET-NN and 
HOG-NN on MPI-Sintel and Kitti are due to the fact 
that a lot of regions with large displacements are not 
covered by any matches, such as the sky or the blurred 
character in the first and second column of Eigure 18. 
Hence, SIET-NN and HOG-NN have only a limited im¬ 
pact on the variational approach. On the other hand, 
the gains are also small (or even negative) for the dense 

^ Note that this systematically improves the endpoint error 
compared to using the raw dense correspondence flelds as flow. 
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Fig. 15 Comparison of matching results of different methods on the Mikolajczyk dataset. Each row shows pixels with correct cor¬ 
respondences for different methods (from top to bottom: ground-truth, SIFT-NN, GPM, NRDC and DM). For each scene, we select 
two images to match and fade out regions which are unmatched, i.e. those for which the matching error is above 15px or can not be 
matched. DeepMatching outperforms the other methods, especially on difficult cases like graf and wall. 


methods despite the fact that they output significantly 
more correspondences. We observe for these methods 
that the weight /3 of the matching term tends to be 
small after optimizing the parameters, thus indicating 
that the matches are found unreliable and noisy dur¬ 
ing training. The cause is clearly visible in Figure 17, 
where large portions containing repetitive textures {e.g. 
road, trees) are incorrectly matched. The poor quality 
of these matches even leads to a significant drop in per¬ 
formance on the Kitti dataset. 

In contrast, DeepMatching generates accurate matches 
well covering the image that enable to boost the optical 
flow accuracy in case of large displacements. Namely, we 
observe a relative improvement of 30% on MPI-Sintel 
and of 50% on Kitti. It is interesting to observe that 
DM is able to effectively prune false matches arising 
in occluded areas (black areas in Figures 16 and 17). 
This is due to the reciprocal verification filtering incor¬ 
porated in DM (Eq. (16)). When using the approxima¬ 
tion with 1024 prototypes, however, a significant drop 
is observed on the Kitti dataset, while the performance 
remains good on MPI-Sintel. This indicates that ap¬ 
proximating DeepMatching can result in a significant 
loss of robustness when matching repetitive textures, 
that are more frequent in Kitti than in MPI-Sintel. 


5.3.3 Comparison to the state of the art 

In this section, we compare DeepFlow to the state of the 
art on the test sets of MPI-Sintel, Kitti and Middlebury 
datasets. For theses datasets, the results are submitted 
to a dedicated server which performs the evaluation. 
Prior to submitting our results for Kitti and Middle¬ 
bury test sets, we have optimized the parameters on 
the respective training set. 

Results on MPI-Sintel. Table 6 compares our method 
to state-of-the-art algorithms on the MPI-Sintel test 
set. A comparison with the preliminary version of Deep- 
Flow (Weinzaepfel et al 2013), referred to as Deep- 
Flow*, is also provided. In this early version, we used 
a constant smoothness weight instead of a local one 
here (see Section 4.3) and used DM* as input matches. 
We can see that DeepFlow is among the best perform¬ 
ing methods on MPI-Sintel, particularly for large dis¬ 
placements. This is due to the use of a reliable match¬ 
ing term in the variational approach, and this property 
is shared by all top performing approaches, e.g. (Re¬ 
vaud et al 2015; Leordeanu et al 2013). Furthermore, 
it is interesting to note that among the top perform¬ 
ers on MPI-Sintel, 3 methods out of 6 actually em¬ 
ploy DeepMatching. In particular, the top-3 method 
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Fig. 16 Comparison of different matching methods on three challenging pairs with non-rigid deformations from MPI-Sintel. Each pair 
of columns shows motion maps (left column) and the corresponding error maps (right column). The top row presents the ground-truth 
(GT) as well as one image. For non-dense methods, pixel displacements have been inferred from matching patches. Areas without 
correspondences are in black. 


Method 

EPE 

EPE-occ 

sO-10 

slO-40 

s40+ 

Time 

FlowFields (Bailer et al 2015) 

5.810 

31.799 

1.157 

3.739 

33.890 

23s 

DiscreteFlow (Menze et al 2015) 

5.810 

31.799 

1.157 

3.739 

33.890 

180s 

EpicFlow (Revaud et al 2015) 

6.285 

32.564 

1.135 

3.727 

38.021 

16.4s 

TF+OFM (Kennedy and Taylor 2015) 

6.727 

33.929 

1.512 

3.765 

39.761 

~400s 

DeepFlow 

6.928 

38.166 

1.182 

3.859 

42.854 

25s 

SparseFlowFused Timofte and Van Gool (2015) 

7.189 

3.286 

1.275 

3.963 

44.319 

20 

DeepFlow* (Weinzaepfel et al 2013) 

7.212 

38.781 

1.284 

4.107 

44.118 

19s 

S2D-Matching (Leordeanu et al 2013) 

7.872 

40.093 

1.172 

4.695 

48.782 

~2000s 

LocalLayering (Sun et al 2014a) 

8.043 

40.879 

1.186 

4.990 

49.426 


Classic+NL-P (Sun et al 2014b) 

8.291 

40.925 

1.208 

5.090 

51.162 

-800s 

MDP-Flow2 (Xu et al 2012) 

8.445 

43.430 

1.420 

5.449 

50.507 

709s 

NLTGV-SC (Ranftl et al 2014) 

8.746 

42.242 

1.587 

4.780 

53.860 


LDOF (Brox and Malik 2011) 

9.116 

42.344 

1.485 

4.839 

57.296 

30s 


Table 6 Results on MPI-Sintel test set (final version). EPE- 
occ is the EPE on occluded areas. sO-10 is the EPE for pixels 
with motions between 0 and 10 px and similarly for slO-40 and 
s40+. DeepFlow* denotes the preliminary version of DeepFlow 
published in Weinzaepfel et al (2013). 


EpicFlow (Revaud et al 2015) relies on the output of 
DeepMatching to produce a piece-wise affine flow, and 
SparseFlowFused (Timofte and Van Gool 2015) com¬ 
bines matches obtained with DeepMatching and an¬ 
other algorithm. 

We refer to the webpage of the MPI-Sintel dataset 
for complete results including the “clean” version. 


Timings. As mentioned before, DeepMatching at half 
the resolution takes 15 seconds to compute on CPU and 
0.2 second on GPU. The variational part requires 10 ad¬ 
ditional seconds on CPU. Note that by implementing it 
on GPU, we could obtain a significant speed-up as well. 
DeepFlow consequently takes 25 seconds in total on a 
single CPU core @ 3.6 GHz or 10.2s with GPU+CPU. 
This is in the same order of magnitude as the fastest 
among the best competitors, EpicFlow (Revaud et al 
2015). 

Results on Kitti. Table 7 summarizes the main results 
on the Kitti benchmark (see official website for com¬ 
plete results), when optimizing the parameters on the 
Kitti training set. EPE-Noc is the EPE computed only 
in non-occluded areas. “Out 3” corresponds to the pro¬ 
portion of incorrect pixel correspondences for an er¬ 
ror threshold of 3 pixels, i.e. it corresponds to 1 — 
accuracy@3, and likewise for “Out-Noc 3” for non-occluded 
areas. In terms of EPE-noc, DeepFlow is on par with 
the best approaches, but performs somewhat worse in 
the occluded areas. This is due to a specificity of the 
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Fig. 17 Comparison of different matching methods on three challenging pairs from Kitti. Each pair of columns shows motion maps 
(left column) and the corresponding error maps (right column). The top row presents the ground-truth (GT) as well as one image. 
For non-dense methods, pixel displacements have been inferred from matching patches. Areas without correspondences are in black. 
To improve visualization, the sparse Kitti ground-truth is made dense using bilateral filtering. 


Kitti dataset, in which motion is mostly homographic 
(especially on the image borders, where most surfaces 
like roads and walls are planar). In such cases, flow is 
better predicted using an afhne motion prior, which lo¬ 
cally well approximates homographies (a constant mo¬ 
tion prior is used in DeepFlow). As a matter of facts, all 
top performing methods in terms of total EPE output 
piece-wise affine optical flow, either due to affine regu- 
larizers (BTE-ILLUM (Demetz et al 2014), NLTGB-SC 
(Ranftl et al 2014), TGV2ADCSIET (Braux-Zin et al 
2013)) or due to local affine estimators (EpicElow (Re- 
vand et al 2015)). 

Note that the learned parameters on Kitti and MPI- 
Sintel are close. In particular, running the experiments 
with the same parameters as MPI-Sintel decreases EPE- 
Noc by only 0.1 pixels on the training set. This shows 
that our method does not suffer from overfitting. 

Results on Middlehury. We optimize the parameters on 
the Middlebury training set by minimizing the average 
angular error with the same strategy as for MPI-Sintel. 
We find weights quasi-zero for the matching term due to 
the absence of large displacements. DeepElow obtained 


Method 

EPE-noc 

EPE 

Out-Noc 3 

Out 3 

Time 

DiscreteFlow (Menze et al 2015) 

1.3 

3.6 

5.77% 

16.63% 

180s 

FlowFields (Bailer et al 2015) 

1.4 

3.5 

6.23% 

14.01% 

23s 

DeepFlow 

1.4 

5.3 

6.61% 

17.35% 

22s 

BTF-ILLUM (Demetz et al 2014) 

1.5 

2.8 

6.52% 

11.03% 

80s 

EpicFlow (Revaud et al 2015) 

1.5 

3.8 

7.88% 

17.08% 

16s 

TGV2ADCSIFT (Braux-Zin et al 2013) 

1.5 

4.5 

6.20% 

15.15% 

12s* 

DeepFlow* (Weinzaepfel et al 2013) 

1.5 

5.8 

7.22% 

17.79% 

17s 

NLTGV-SC (Ranftl et al 2014) 

1.6 

3.8 

5.93% 

11.96% 

16s* 

Data-Flow (Vogel et al 2013b) 

1.9 

5.5 

7.11% 

14.57% 

180s 

TF-I-OFM (Kennedy and Taylor 2015) 

2.0 

5.0 

10.22% 

18.46% 

350s 


Table 7 Results on Kitti test set. EPE-noc is the EPE over 
non-occluded areas. Out-Noc 3 (resp. Out 3) refers to the per¬ 
centage of pixels where flow estimation has an error above 3 pixels 
in non-occluded areas (resp. all pixels). DeepFlow* denotes the 
preliminary version of DeepFlow published in Weinzaepfel et al 
(2013). • denotes the usage of a GPU. 


an average endpoint error of 0.4 on the test which is 
competitive with the state of the art. 


6 Conclusion 

We have introduced a dense matching algorithm, termed 
DeepMatching. The proposed algorithm gracefully han¬ 
dles complex non-rigid object deformations and repeti¬ 
tive textured regions. DeepMatching yields state-of-the- 
art performance for image matching, on the Mikola- 
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Fig. 18 Each column shows from top to bottom: two consecutive images, the ground-truth optical flow, the DeepMatching, our flow 
prediction (DeepFlow), and two state-of-the-art methods, LDOF (Brox and Malik 2011) and MDP-Flow2 (Xu et al 2012). 


jczyk (Mikolajczyk et al 2005), the MPI-Sintel (Butler 
et al 2012) and the Kitti (Geiger et al 2013) datasets. In¬ 
tegrated in a variational energy minimization approach 
(Brox and Malik 2011), the resulting approach for op¬ 
tical flow estimation, termed DeepFlow, shows compet¬ 
itive performance on optical flow benchmarks. 

Future work includes incorporating a weighting of 
the patches in Eq. (2) instead of weighting all patches 
equally to take into account that different parts of a 
large patch may belong to different objects. This could 
improve the performance of DeepMatching for thin ob¬ 
jects, such as human limbs. 
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